prio-queue: use cascade-down sift for faster extract-min by spkrka · Pull Request #2132 · gitgitgadget/git

spkrka · 2026-05-30T17:07:59Z

Hi, I am not sure this is just noise or not but I thought it at least was
interesting.

I looked into the internals of prio_queue and found it was technically
doing too much work and could be simplified/optimized. I found I could
optimize it by ~20% for the common case (adding commits that would typically
end up far back in the queue) but only ~1% for the reverse case (adding
things to the front of the prio queue). The average speedup is somewhere in
between I suppose. That said, this is not really the bottleneck so the overall
boost seems to be around ~3-4% improvement for repos with wide DAGs.

I would normally classify this as not urgent or important, but I think the
advantage is that the change is very small and simple and it already has good
unit tests (t/unit-tests/u-prio-queue.c).

With that said, here are the details:

The prio_queue_get impl is based on removing the root entry, then
moving the very last element into the root slot, then sifting it down into
the right place. This uses both comparisons between sibling elements in
the heap as well as comparisons between the element to add and one of
the siblings. Then it uses swap operations to move things correctly.

This patch instead promotes the smaller child upward at each level, leaving
a vacancy that sinks to a leaf, then places the removed element there with
a short sift-up to keep the heap balanced.

We can analytically compare this - for a sift-distance of d we can reason
about the number of operations to execute.

Before: 2d comparisons + 3d copies
After:   d comparisons +  d copies

After changing sift_down in this way, the replace operation can't simply
depend on it anymore, so I reimplemented it as a sequence of get + put.
This is technically correct but maybe not as efficient. However, I am not
sure that it matters, since I couldn't see any usage of the replace operation
in any hot path.

Performance:
Profiling git rev-list --count on a 2.5M-commit monorepo shows sift_down_root
dropping from 8.2% to 0.4% of total runtime, effectively eliminated as
significant overhead.

Synthetic benchmark
10 rounds of 10M put+get cycles, CPU-pinned, median of 3 runs, same compiler
and Makefile flags.

Ascending keys (git's typical pattern -- parents have lower priority than
children):

queue width  baseline  patched  speedup
         10     4.32s    3.97s    1.09x
        100     7.95s    6.49s    1.23x
      1,000    11.30s    9.66s    1.17x
     10,000    16.34s   14.15s    1.16x
    100,000    21.43s   18.66s    1.15x

Descending keys (worst case — last element always sinks to leaf in both approaches):

queue width  baseline  patched  speedup
         10     4.84s    4.78s    1.01x
        100     9.43s    9.20s    1.03x
      1,000    15.28s   14.71s    1.04x
     10,000    23.61s   23.49s    1.01x
    100,000    29.16s   28.22s    1.03x

No regressions in any scenario.

End-to-end benchmarks

All benchmarks use a benchmark setup of 1 warmup run followed by 10 timed
runs. Each configuration is built from the same source tree and tested on
the same repo in alternating order.

linux kernel (1.4M commits) — range v5.0..v6.0 (311K commits):

Command                      baseline  patched  speedup
rev-list --count v5.0..v6.0     455ms    440ms    1.04x

I also ran it on git.git but did not see any performance diff at all, due
to the size and narrow DAG.

The improvement scales with DAG width: wider DAGs produce larger priority
queues, amplifying the per-level savings. In small or narrow repositories
the priority queues stay shallow and the sift-down cost is already
negligible, so the change is not noticeable.

Replace the standard sift-down in prio_queue_get() with a cascade-down approach. The standard approach places the last array element at the root, then sifts it down. At each level this requires two comparisons (left vs right child, then element vs winner) and, when the element is larger, a swap (three 16-byte copies). The cascade approach instead promotes the smaller child into the vacant root slot at each level — one comparison and one copy. The vacancy sinks to a leaf, where the last array element is placed and sifted up if needed — typically zero levels since the last array element tends to be large. In the common case, work per extract drops from 2d comparisons + 3d copies to d comparisons + d copies: roughly half the comparisons and a third of the data movement. The sift-up phase can add work when the last element is smaller than ancestors of the leaf vacancy, but this is rare in practice. Simplify prio_queue_replace() to a plain get+put sequence. This is semantically equivalent: the old implementation wrote to slot 0 and sifted down, which has the same observable effect as removing the root and inserting a new element. No caller observes queue state between the two operations. The previous implementation shared sift_down_root() with get, but the cascade approach no longer accommodates that cleanly since sift_down_root() now expects the element to reinsert at queue->array[queue->nr], left there by prio_queue_get() after decrementing nr. This is fine in practice: replace is only called from pop_most_recent_commit() (fetch-pack, object-name, walker) and show-branch — none of which appear in any hot path. A synthetic benchmark (10 rounds of 10M put+get cycles, ascending integer keys, CPU-pinned, median of 3 runs, same compiler and Makefile flags) shows consistent improvement across all queue sizes, with no regressions: queue width baseline cascade speedup ------------------------------------------------ 10 4.32s 3.97s 1.09x 100 7.95s 6.49s 1.23x 1,000 11.30s 9.66s 1.17x 10,000 16.34s 14.15s 1.16x 100,000 21.43s 18.66s 1.15x With descending keys (worst case — the last element always sinks to a leaf in both approaches) the cascade still wins slightly (1-4%) by replacing swaps with copies, and never regresses. In end-to-end git commands the improvement is modest because sift_down_root is only ~8% of total runtime. Profiling rev-list --count on a 2.5M-commit monorepo shows sift_down_root dropping from 8.2% to 0.4% of total runtime. The improvement scales with DAG width: wider DAGs produce larger priority queues, amplifying the per-level savings. In small or narrow repos the queues stay shallow and the effect is negligible. Signed-off-by: Kristofer Karlsson <krka@spotify.com>

spkrka · 2026-05-31T17:56:22Z

/submit

gitgitgadget · 2026-05-31T17:57:27Z

Submitted as pull.2132.git.1780250236304.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2132/spkrka/cascade-sift-down-v1

To fetch this version to local tag pr-2132/spkrka/cascade-sift-down-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2132/spkrka/cascade-sift-down-v1

spkrka force-pushed the cascade-sift-down branch 8 times, most recently from a45f027 to 0a3a2b0 Compare May 31, 2026 08:20

spkrka force-pushed the cascade-sift-down branch from 0a3a2b0 to 9ca2fab Compare May 31, 2026 08:25

spkrka marked this pull request as ready for review May 31, 2026 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prio-queue: use cascade-down sift for faster extract-min#2132

prio-queue: use cascade-down sift for faster extract-min#2132
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:cascade-sift-down

spkrka commented May 30, 2026 •

edited

Loading

Uh oh!

spkrka commented May 31, 2026

Uh oh!

gitgitgadget Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spkrka commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spkrka commented May 31, 2026

Uh oh!

gitgitgadget Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spkrka commented May 30, 2026 •

edited

Loading